8 research outputs found
IndicXTREME: A Multi-Task Benchmark For Evaluating Indic Languages
In this work, we introduce IndicXTREME, a benchmark consisting of nine
diverse tasks covering 18 languages from the Indic sub-continent belonging to
four different families. Across languages and tasks, IndicXTREME contains a
total of 103 evaluation sets, of which 51 are new contributions to the
literature. To maintain high quality, we only use human annotators to curate or
translate our datasets. To the best of our knowledge, this is the first effort
toward creating a standard benchmark for Indic languages that aims to test the
zero-shot capabilities of pretrained language models. We also release IndicCorp
v2, an updated and much larger version of IndicCorp that contains 20.9 billion
tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate
it on IndicXTREME to show that it outperforms existing multilingual language
models such as XLM-R and MuRIL
Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages
We present, Naamapadam, the largest publicly available Named Entity
Recognition (NER) dataset for the 11 major Indian languages from two language
families. The dataset contains more than 400k sentences annotated with a total
of at least 100k entities from three standard entity categories (Person,
Location, and, Organization) for 9 out of the 11 languages. The training
dataset has been automatically created from the Samanantar parallel corpus by
projecting automatically tagged entities from an English sentence to the
corresponding Indian language translation. We also create manually annotated
testsets for 9 languages. We demonstrate the utility of the obtained dataset on
the Naamapadam-test dataset. We also release IndicNER, a multilingual IndicBERT
model fine-tuned on Naamapadam training set. IndicNER achieves an F1 score of
more than for out of test languages. The dataset and models are
available under open-source licences at
https://ai4bharat.iitm.ac.in/naamapadam.Comment: ACL 202
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
India has a rich linguistic landscape with languages from 4 major language
families spoken by over a billion people. 22 of these languages are listed in
the Constitution of India (referred to as scheduled languages) are the focus of
this work. Given the linguistic diversity, high-quality and accessible Machine
Translation (MT) systems are essential in a country like India. Prior to this
work, there was (i) no parallel training data spanning all the 22 languages,
(ii) no robust benchmarks covering all these languages and containing content
relevant to India, and (iii) no existing translation models which support all
the 22 scheduled languages of India. In this work, we aim to address this gap
by focusing on the missing pieces required for enabling wide, easy, and open
access to good machine translation systems for all 22 scheduled Indian
languages. We identify four key areas of improvement: curating and creating
larger training datasets, creating diverse and high-quality benchmarks,
training multilingual models, and releasing models with open access. Our first
contribution is the release of the Bharat Parallel Corpus Collection (BPCC),
the largest publicly available parallel corpora for Indic languages. BPCC
contains a total of 230M bitext pairs, of which a total of 126M were newly
added, including 644K manually translated sentence pairs created as part of
this work. Our second contribution is the release of the first n-way parallel
benchmark covering all 22 Indian languages, featuring diverse domains,
Indian-origin content, and source-original test sets. Next, we present
IndicTrans2, the first model to support all 22 languages, surpassing existing
models on multiple existing and new benchmarks created as a part of this work.
Lastly, to promote accessibility and collaboration, we release our models and
associated data with permissive licenses at
https://github.com/ai4bharat/IndicTrans2
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
We present Samanantar, the largest publicly available parallel corpora
collection for Indic languages. The collection contains a total of 49.7 million
sentence pairs between English and 11 Indic languages (from two language
families). Specifically, we compile 12.4 million sentence pairs from existing,
publicly-available parallel corpora, and additionally mine 37.4 million
sentence pairs from the web, resulting in a 4x increase. We mine the parallel
sentences from the web by combining many corpora, tools, and methods: (a)
web-crawled monolingual corpora, (b) document OCR for extracting sentences from
scanned documents, (c) multilingual representation models for aligning
sentences, and (d) approximate nearest neighbor search for searching in a large
collection of sentences. Human evaluation of samples from the newly mined
corpora validate the high quality of the parallel sentences across 11
languages. Further, we extract 83.4 million sentence pairs between all 55 Indic
language pairs from the English-centric parallel corpus using English as the
pivot language. We trained multilingual NMT models spanning all these languages
on Samanantar, which outperform existing models and baselines on publicly
available benchmarks, such as FLORES, establishing the utility of Samanantar.
Our data and models are available publicly at
https://indicnlp.ai4bharat.org/samanantar/ and we hope they will help advance
research in NMT and multilingual NLP for Indic languages.Comment: Accepted to the Transactions of the Association for Computational
Linguistics (TACL
A survey in Adversarial Defences and Robustness in NLP
In recent years, it has been seen that deep neural networks are lacking
robustness and are vulnerable in case of adversarial perturbations in input
data. Strong adversarial attacks are proposed by various authors for tasks
under computer vision and Natural Language Processing (NLP). As a
counter-effort, several defense mechanisms are also proposed to save these
networks from failing. Defending the neural networks from adversarial attacks
has its own importance, where the goal is to ensure that the model's prediction
doesn't change if input data is perturbed. Numerous methods for adversarial
defense in NLP are proposed of late, for different NLP tasks such as text
classification, named entity recognition, natural language inferencing, etc.
Some of these methods are not just used for defending neural networks from
adversarial attacks, but also used as a regularization mechanism during
training, saving the model from overfitting. The proposed survey is an attempt
to review different methods proposed for adversarial defenses in NLP in recent
years by proposing a novel taxonomy. This survey also highlights the fragility
of the advanced deep neural networks in NLP and the challenges in defending
them
V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages
We present V\=arta, a large-scale multilingual dataset for headline
generation in Indic languages. This dataset includes 41.8 million news articles
in 14 different Indic languages (and English), which come from a variety of
high-quality sources. To the best of our knowledge, this is the largest
collection of curated articles for Indic languages currently available. We use
the data collected in a series of experiments to answer important questions
related to Indic NLP and multilinguality research in general. We show that the
dataset is challenging even for state-of-the-art abstractive models and that
they perform only slightly better than extractive baselines. Owing to its size,
we also show that the dataset can be used to pretrain strong language models
that outperform competitive baselines in both NLU and NLG benchmarks.Comment: Findings of ACL 202
Towards Building ASR Systems for the Next Billion Users
Recent methods in speech and language technology pretrain very LARGE models
which are fine-tuned for specific tasks. However, the benefits of such LARGE
models are often limited to a few resource rich languages of the world. In this
work, we make multiple contributions towards building ASR systems for low
resource languages from the Indian subcontinent. First, we curate 17,000 hours
of raw speech data for 40 Indian languages from a wide variety of domains
including education, news, technology, and finance. Second, using this raw
speech data we pretrain several variants of wav2vec style models for 40 Indian
languages. Third, we analyze the pretrained models to find key features:
codebook vectors of similar sounding phonemes are shared across languages,
representations across layers are discriminative of the language family, and
attention heads often pay attention within small local windows. Fourth, we
fine-tune this model for downstream ASR for 9 languages and obtain
state-of-the-art results on 3 public datasets, including on very low-resource
languages such as Sinhala and Nepali. Our work establishes that multilingual
pretraining is an effective strategy for building ASR systems for the
linguistically diverse speakers of the Indian subcontinent. Our code, data and
models are available publicly at https://indicnlp.ai4bharat.org/indicwav2vec/
and we hope they will help advance research in ASR for Indic languages